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The decisioa by the National Institute of Education 
panel on the effects of school desegregation to select (for v 
meta-analysis) a small group of preferred* studies based upon criteria, 
chosen in advance of examining the studies was,«i-n principle, a 
mistake. One usually cannot know until the data haye been examined 
Which'of several competing methodological criteria are most 
important. In the case of the effects of desegregation on minority 
achievement 7 Crain and Mahard in their 1982 review of 93 
desegregation studies fourid a methodological error so specific to 
desegregation research trftat it was "not even recognized as an eirror 
— uatil the review was done. The error was that studies of the effects 
of desegregation on minority achievement will underestimate any 
effects when using subjects who have not bc^n in desegregated 
settings since kindergarten or Grade 1/. Whereas Crain and Mahard 
found 20 studies of blacks in desegregated settings since ' , 

kindergarten or Grade 1,- the panel discarded all but one of them * 
because they did not fit their chosen-in-advance criteria. Of the 20 
studies identified by Crain and Mahard, 16 showed consistent positive 
outcomes and only 2 were negative. If the principal function of 
selecting a. superior subgroup of studies is: to find the consistency^ 
of results which is masked by error in an unselected sample, Crain 
and Mahard succeeded , and the panel ttid not. (CMG) 
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Dilemmas lit. Meta-Analysis: A Reply to Reaxialyses of the 
Desegregation- Achievement Synthesis. 



„ Robert L. Grain 

The Rand Corporation and 
y J The Center for Social Organization of Schools 

0^ ^ Johns Hopkins University 

- r yqr> : . • ;> \ . / 

^ V In this volume a group of scholars have come together to assess the 

state of our knowledge about the, effects of school desegrefcatJon on black 
<=> achievement test scores. The scholars were selected'to represent a .range 

uj r • . 

of personal Ideologies. Thus this project should provide a near-perfect ^ 
• opportunity to array a group of social sicentists along a continuum from 

left to right W demonstrate that the scientific conclusions they draw are 

consonant with their personal politics. Doing so would present strong , 

evidence that our worst fear is true-that social science is not really 
science, and- government, in employing social science, h* ^rely been 
financing propaganda. Perhaps ^ one can draw this conclusion from the, panel's . 

• work s but I don't think so. * ^ J 

First, it is not so easy to attach political positions to working social 
scientists. It makes good sense to classify me as a "liberal;" I have 
testified in a number of court cases, and while this has some* imes_ been as 
• a court-appointed expert or on behalf of a school board" resisting desegregation, 
it has' usually been as an expert called by the plaintiffs in a suit trying , 
to bring about desegregation. .Other mem&rs of this panel have testified 
for school boards resisting .desegregation or have been called to present the ' 
anti-busing position in congressional hearings. But to ,at least two cases 
putting labels on members of the panel is not so easy to do. Paul Wortman . 
rO was selected as a liberal mainly because he had completed a literature 
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review showing positive effect, of delegation on black achievement; 
m d Walter Stephen ^selected ee a "neutral" hecause ha la. the anther of 

•« f->, a f there were few positive ~ef fects of 

sa earlier review concluding that there wets r 

Segregation. But ever, edentlet whose data eupport a Mac* position 1. 
it necessarily a liberal, Just as every, scientist who .agreed with Copernicus 

i 

was not anti-Christian. 

It is also not so eesy'to show a correlation between .personal ideology 
and scientific portion > It is trne that I, the odious liberal 9 n the panel, 
m Che co-anthor of a literature reviaw (Crain and Menard, 1982) arguing that 
desegregation seems to raise Bias, achieved hy .3 standard deviations, 
. large r estimate Chan any other member of the panel has Bade; and the panel'. 
- Mt ohvions cona^ativa. »avid Armor, has produced the smallest estimated 
achievement effect of an, memher of the panel. But if political position 
ve re dominant hare, its effect would have to appear in the vay the panel 
eelected the. 19 studies it considered host. Paul Wortman read the studies 
gathered hy Hahard and ma (1982) and hy Krol (1978) and recorded to the 
panel a grbup^f 31 atudiea a. being of superior iq uaHty; the 18 that the 
panel chose Concept from that offering are in fact only slightly less , 
positive * their aasessment of desegregation Chan the onea they declined 
„ una. There ia little evidenced hie* in their choice. It is true that. 
When the panel veered from its normal course of using only the data provided 
h, Wortman. it did ,0 L -add one atody Which had found a negative effect 
of desegregation' and to add additional- data strengthening a second study 
in the groups IS whined found a negative off act, But this is not vet, 
• otrong ^ a.x an ideological interpretation of the actions of the 
author.. Pinally. one might simply note that when the liharala, Crain and 



Hahard. reviewed the literature on desegregation the, gathered together 
„.t-i...— .«•« * deeegregation-on'Uach achievement was + .08 
arandard deviation!, pooling reading and «th effects together; .the 

< n « A™or reviewed 15 studies and found an effect on reading 
.toaervativ. David Armor reviewed ^ 

(Jtorea of + .ll and . on. math . scores o* .00_an average of .055. 

^ orPlv 180° of political ideology are accurately 
relieve that approximately ■ P ^ di£fer 

translated into the selection of : two samples whose 
by only .025 standard deviations. 1 

Xaeology does appear in some of the essays in. .this volume, including 
thi6 one; but it tends to show up mostly in the collusions and. inter. • 
pr etaUons-in the words rather than the numbers. One reason it does not 

u 4 ,hp t it is very difficult for contemporary social =sc 
show in the -numbers is that it is very «. 

w The technique used here for assessing 

' ehtists to c« ssag ree about methodology. The tecrmiq 

effect ei,e v., proposed h y foreman aa neither a liheral nor a conaervative 
. solution-, it was acceptek hv .11 the — « of the panel «gardlee. of 

" personal ideology. • 

. But 'this is not to say that there are no differences worth noting among 
" the panelists, or that these differences have no consequences. There is an 
portent division among the ambers of the panel, hut on a methodological, 
„ot ideological, issue-the question of whether one, in reviewing literature, 
6 bould select only the hetter studies J concentrate on them, or review 

find There is in this panel a rather neat correlation 
all the studies one can find, pere 

». «-„ inok at and the size of the- , 
between the number of studies one chooses to look 

I ' ■„ Craln and Mahsrd, using 93 studies, 

effect of desegregation one finds. Cram ana 

9 u-i v o.MPvenent something on the order or 

conclude that desegregation raises black achievement 



1/4 to 1/3 of a standard delation. Wortman, reviewing 31 stud^esv concludes 
'that the gain is perhaps 1/5 o£ a standard deviation. The others, Rising 19 

or fewer studies, conclude that desegregation raises black achievement by 

. • ■ 

perhaps 1/8 of a standard deviation or perhaps less. I would like to argue - 

I r 

that in this particular case it is not an* accident that the number of studies 
reviewed is related to the conclusions 'drawn. 

The question of whether one should selectively review literature or 
review all of it has been a subject of considerable debate among scientists 
using what is now called meta-analysis— the computer-assisted review 
of studies of a particular question. At first thought, the- argument that 
one should chdose t the best studies and leaved the chaff aside seems unquestionalby 
the right answer. Certainly the counterargument that one should include all 
the,- gtudies hprmmf* *rror f q *-g^^™--wr3^^ 'with a large enough 

sample of studies errors will cancel themselves out and reveal .the truth — 
seems quite inadequate. ^ ' m 

Selection of the good studies seems Tike the obvious answer only as 
long as we sleepily think that our task is only to fi^kthe competent evaluations 
of a particular p^egram and compute an overall average program effectiveness 
score. Most of the meta-analyses done to date and, most of the literature 
reviews discussed by ferbert Walberg( in this volume are in fact of this type, 
but there is no reason they mist or should be this simple. First, one often . 
wants to know more about a new intervention than simply whether it works; 
we often need to know how and why as well. And even if we only want to know 
Whether there ^s an overall treatment effect, there are better w6ys than 
throwing away most, of the research. Suppose th^re are 100 studies bf an 
innovation. Rather than choosing the ten -supposedly best studies and 



campucins . averse e« e ct one might *-»* 100 *» * . 

* 1 — the 10 '^ raately - ., 

^ might evaluate ell 100 studies •«° D . " ' ' 

t*- in ^ research, to t*,e studies vnicH ate particular!, »ea* 
strong; rather -than counting each stud, eau.ll,.- -*« «— *• 
particular veasetudies as heing onl, a faction of toe better studies. 
Alternate!,, one might.*, as Menard - I did and construct an additive 

model, assuming that — - -"^nCalln " 

predict or underpredict the treatment effect by r 

. . . - mi t-hree of these alternatives 
■ estimate * through seme statistics! procedure. All three 

.re W- of emphasising the tfest studies afte, an' empirical analysis of all 
of them, ail else e q ual. of course ve _di P-V, to select the heat . 
*" «le. from a group through an empirical analysis rather than from an - 

a priori judgment. • j _ t 

" ' favor of prior selection is that 

„ Viewed this way, the on^rgument in favor pri . 

Res this can be a convincing argument. With limited 
of efficiency. In many cases this can De a , 

resources one cannot afford to spend vast counts of -time ^-through 
i0 ,ens of »esh studies in order -to gain a modest amo^t of information . 
Civen the short durstion of .this project, it might have heen. impossihle for 

panel to review all 100-odd studies * desegregstion and , lacs ach evemeat. 
Perhaps selecting a small group uas the onl, uorhahle plan. But this does., 
not mean that it was a good plan. 

*4,.~f that eelection of a small group 
In this paper we will argue, first, that 

•* V Mfi i„ e criteria chosen in advance of examining 

of preferred studies from a pool using crite 
I , ; . ■ Wstake/ ^ will then go on to show that in 

khefrtudies *- jS Principle a mistake. 



this case a -istake in" principle was alao «A.Ue. 1" Practice: the panel, 
in selecting 19 studies t*. the pool of i'oO. 6 " i0US 



error. .. ~, 

r- : . ' ....... - ■ 



jf^jn^rPtVral Problems with Prior Selection -- 

. " The analogy to weighting in survey research is ussful. purveys, it 
is often the case that particular clasps of respondents. are especially 
valuable for analysis, and these respondents are overbed. However, the 
total sample is then no longer representative of the general population. 
The solution is to assign a weight, a multiplier, to each of the oversampled, 
cases so-that if three times as many cases in one particular class are 
' selected, each is treated as only 1/3 of; a case in the final anlysis. The 
• selection of sc** studies to include in a 'meta-analysis while others are 
rejected- i* essentially a decision to assign a weight of 1 to some studies- _ 
and a weight of 0 to all other,. The simplest way to justify doing so is to 
divide the studies into a small au*** t>f discrete categories, arguing that 
' - every study 4 in certain' categories!* worth examining while, none of the 
. studies in the othe.r categories is. • Unfortunately, anyone that has read . 
literature such as the desegregation-achievement material knows .how difficult 
it would be to justify doing this. 1 \ 

. If one does not. accept the idea that the studies' can ha neatly divided 
into two discrete categories, one good and one had. then a acre systaltic 
approach is to ran* the studies hy quality, putting tha hast studies at tha 
top of the list and «h» ^vingyown the list until -e find an appropriate 
rSut-off poln; so we can discard Studies helow a cartel, level of quality. . 
\ There are several proplea. with thi.*a*,ro.ch.< * The first -is that stud, • 

. • - : r> ■ - 



quality is a multi-dimensional concept; a study which is good in one 
respect may not be in another. Even if studies that are good ;iii *ne. respect 
Send to be better than average in others, tow does one choose to rank one 
&udy which "is very good in category A and only moderately good in category 
"l' above or below another study which is very goo£ in B and only above average 
in A?' While I have not attempted a formal proof, T believe that the Arrow 
paradox (1951) can be used to show that such a ranking is impossible unless 
one is willing to assign definite numeric values to, for example, the 
relative merits of increasing the sample size versus using a pretest measure 
of higher reliability. If it is not possible for one person to rank the ■ 
studies unequivocably from best to worst, it is certainly impossible for a 
group of scholars to do so-meaning that one cannot expect the readers of 
a^meta-analysisHo agree with the author that 'the right decision has been. 
made about~study- selection. 

' At this poin| the reader may argue that I am being a bit pedantic; 
that all science is imperfect, and more important ly* is dependent on scarce 
resources. With orlly a certain amount of money and time available, one should 
'•not spend it .rooting through hundreds of useless studies, carefully recording 
all 'their faults. If one used the weighting procedure suggested earlier, 
one would have to read each study, enter <jLts data into- the computer, and 
perhaps compute weights designed, for example, to minimize the variance in 
' the overall estimate by assigning low weights* to classes of ^udies which \ . 
have relatively large variability in their estimates of treatment effect, r 
Alternately, if one uses the algebraic model that Crairi and Mahard used, * 
one must run regression equations trying to estimate the proper amount to 
add ot subtract from the treatment effects generated by studies of a . 



particular kind. All of this takes tWand money away from the main 
objective, which presumably is to find the best studies and see what they 

, -r ' 

say. . . • ,*"'.' 

?-. It seems to me that, the best way to settle this argument is empirically. 

k have here an example of each kind of research. Can we compare them 
and conclude whether the selection of a small number of supposedly better 
studies is a wiser strategy -than a brute force analysis of the entire 
literature? • , 

The Real-World Problems with frlor Selecti on of Xtesfcfireaatlon Studies 

The wile. "1th selecting the best studies of desegregation and black 
.cM^j fi. »* - xaly thai the multiple criteria which can be used for 
lection « perfectly correlated; the criteria are In fact negatively , 
laced. The data which Hahard and X asse.bled on, the 93 studies demon- 
strate this-.- Methodologically superior studies presu»ably have larger sample- 
alaes^ongltudinal research designs, and evaluate situations which .ore 
accurately represent the policy being Investigated... In this case,, .ore . 
■recent desegregation plans, are „6re interesting to study than earlier 
desegregation plans because qhey presumably -represent contemporary policy 
„. accurately; and the students being studied should be students who 
have experienced ^segregation froa kindergarten or first grade, since that 
is the way desegregation- ie done in perhaps- 95* or- - M re-of ,-all.desegregatlpn, 
plans in the United States. Table 1 showa the interrelations. ,a»ng 
these four criteria.' The correlations are, on the whole negative. Studie. ■ 
which have large sample aixea tend not to be longitudinal. The »r. recent- 
the desegregation plan being studied, the lese likely it la that the etudy 
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Table 1: Correlations among Study 
v Methodological Attributes 
and Study Outcomes 



%^aMty" ' 

\ Sample Size (Large) 

Longitudinal Design (Yes) 

"Representativeness" 

* Date of Deseg. (Later) 

" Grade Deseg. .began ~ - 
(at early grade) 

Outcome: Effect Size (+) 



Samp. 
Size . 


Longit. 

Design 


Late 
Date 
Deseg. 


Early . 

Grade 

Deseg. 


Effect 
Size 




-.23* 


.33* 


-.10 


-.04 . 


-.23* 




.03 


-.05 


.13* 


.33* ' 


.03 




-,19* 


-.08 


-.10 


-.05 


-.19* 




.24* 


-.04 


.13* 


-.08 


.24* 
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^ of student/who were desegregated « «ndergarreo or «~ 
•i„. latter "agatfve conation is -« a necessity since a 

..oc rt-iidotits to reach an 
, vT e „ nf had time for its youngest stud-nts 
desegregation plan has not had 

?„. « ^ , Rilv te sted.) If one «ants to choose the best , 

ic where they can be eaaily = testea.; 

* Mrt. field there are hard trade-offs to be made- 

studies from among "this field, 

. HnnR between the various 
The last line of Table 1 shows the correlations .betw 

11 size. We know that most 

methodological dleensions and the ~ar.ll .ef f a-t , 

Btud ,. of desertion show a Positive — - — ^ ' * ft 

2u io.si^.1 da.1.. ~ — - — " ^; 

■ V 1. be a significant positive correlation between using a 
that there should be a signifi P 

W-dinal designed the magnitude of the treatment effect. 

' ' . . ,„„■ «f feet of the thirty-one 

this, pointing out that the ^verege treatment effec 

_ m,o AveraRe treatment ettect 
■ etudlee he selected' 1. considered, higher than the ; ever.g 

_ - j But - bV the same criteria, 
o£ the pool of „ which Orel. -hard ^. ^ 

• « — ? — " ~ 7; ^g Positive correlation < 

... flr«t Made, mid there Is e strong r 
at klndergerten or first grao , s_ 

, ^re desegregetlon is hegnn^d the treet»ent effect » 

between the grede where desegr g -jr to «m- 

, ht of t.hle 1) it fbllowsVtS^the gr.de at which desegre 
(see the lower right of . It uou ld .be extre«ly 

gatlon beg- is slso an portent selection criW 

h antlcip.ted this "in advanced seeing this correletion. 
difficult to have antldpeted deBegre gation plan is edopted In 

But the problem is, serious. Imagine thet e desegreg . 

l ui - - -.1 reseercher decides to evaluate it. «, Cncas - 
Z he or Sbe-.11 Choose ,0 study the Plan during its fitstyear or 



two. . The researcher will not want to wait until the plan has been in -place 
for a decade and is no longer of policy interest or newsworthy. The 
chances are also good the researcher v?ill do the evaluation by studying the 
^es't performance of students in the middle elementary grades. These are 
the youngest grades where students can be easily and accurately tested. In 
a typical design, the students will have attended segregated schools until y.he 
end of second grade, be pretested, transfer to desegregated schools, and 
be posttested^ayear later. This is a very clean design, resembling a 
laboratory experiment. But it is not a study of the right problem. The 
experience of the. students being studied— segregation for three years followed 
by one year of desegregation— is quite atypical, a transitory stage in the 
school district's desegregation process. Their younger siblings and all 

/ 

future students in this school system will have four years of desegregation 
at the end of grade three. And according to Table 1, their achievement gains 
as a. result of desegregation will be considerably more positive than that 
of the students being studied by this (or most) researchers. The 93 studies 
Mahard' and I located included 295 samples of students; of these four-fifths 
received a mixed schooling, partly segregated and partly desegregated. 

This Illuminates the main problem with the prior selection approach- 
that it assumes that, the methodological criteria which'define a good study 
are known in advance. This is an assumption we normally take for granted. 
We know what sort of design is superior and what sort inferior and therefore 
can make an . a priori decision about the quality of any particular study. 
However, it is unlikely that in practice we can ever actually do this. First 
of all, one usually cannot know until the data has been examined which 
of several competing methodological criteria are aost important. If there 



are various threats to validity, the importance of any particular threat 
depends 8 good bit upon the particular type of research being done. For 
example: if achievement test scores are the dependent variable. . then 
feliability of pretest and posttest measures is likely to be less of a problem 
Ln if the stud/ deals with measurement of psychological attitudes. 
Second example: studies of student absenteeism based on official reports ^ 
are likely to be reasonably accurate and one might choose to ignore those 
studies based on self-reported absenteeism. At the same time, a study of 
juvenile delinquency might choose to include the studies using self-reported 
delinquency and exclude studies using delinquency reported by official sources 
.on the grounds that official reports of delinquency are notorious^inaccurate. 
The same criteria are applied in directly opposite ways in two studies depending 

upon the subject being studied. 

In the case of the effects of desegregation on minority achievement 
we have found a methodological error-studying students whose education 
was a mixture of Strega"™ and desegregation - which ia so 
Specific to delegation research that It was not even recognized as an 
■ error and source of bias until our review was done. Table 1 suggests that 
studies of the effects of desegregation on minority achievement which use 

as subjects students which have not e*perl«iced JU*SS**»L « e ?-?H£*!^ 

treatment beginning in kindergarten or grade. 1 underestimate the 

effects of desegregation. One might assume that such an error would be ,uite 
• rare, since virtually every desegregation plan in the United States begin, 
in kindergarten or grade 1 at ^ latest. However, a large maiority of 
researchers who have studied the effects of desegregation committed this 



error , of studying students whose desegregation began not in. the normal -. * 

fashion at the beginning of their entry into school, but only after they 

l©d received some education in segregated schools and the reason they have 

j^pne so Is obvious: they wanted to publish quickly on this timely topic, 

siid they wanted to study students who were old enougtf to be reliably , tested. 

The panel, in selecting the ni.ieteen studies which they considered to 
v;r ' 
be ttethodologically/superior, did not require that the students being studied 

have a desegregation experience beginning, in kindergarten or first grade. 

They used instead various other criteria, including that the study be 

longitudinal; and herein lies the problem. Table. 2 shows the relationship 

between design type and grade at which students are desegregated. . Only 18%— 

two studies—of students desegregated at kindergarten are longitudinal. 

The reason is obvious— it is difficult to pretest students who have not yet 

learned to read. And neither of these two studies were selected by the 

i 

panel. The second column shows the percentage of studies at each grade 
selected by the panel. Mahard and I found a total of twenty studies of 
desegregated black students with desegregation, beginning in kindergarten or 
first grade and which contained a segregated black control group. .The panel 
used the data from only one of these studies. The remaining nineteen 
studies were discarded, usually because t$iese very young ehildren-did-not 
provide accurate pretests for longitudinal analysis. Sight of the twenty 
studies we identified used cohort ^omparison— comparing the scores of 
kindergarten and first grade students after desegregation "to the scores of 
fcha students who had been in kindergarten and first grade the preceding year. 
The panel, making a rather conventional scientific decision, had Judged 



Table 2s Use of Longitudinal Design and Inclusion 
.of Sample in Panel Substudy, by Grade of 
First Desegregation 



Grade 


Percent 

of studies 

with longitudinal 


» Percent 
of studies 
included in 
substudy 


n 


KG 


18% 


OX 


11 


1 


41% 


/ «4% 


•' 44 


2 


53% f— 


14% 


36 


3 


63% | 


13% 


54 


4 


47% , 


21% 


38 


5 


42% 


10% • 


40 


6 


40% / 


8% 


25 


7-12 


59% > 


6% 


49 



th^e studies to be of inferior quality and excluded them.. While it.is 
true that in principle a cohort comparison is inferior to a longitudinal 
experimental or quasi-experimental design, this is precisely an example of 
j*e situation where there are competing metholpdogical criteria, and the 
Lice cannot be wisely made in advance of looking at the data. % Jn this - 
case a cohort study is superior because ^enables us to study students who 
had begun desegregation in first grade. . / . c 

$ 

rgi-^tlng the Effect of Desegregation . „ 

The nineteen studies .elected b, the panel of scientist, show an overall^ 
effect of desegregation on achievement which 1. slightly more positive than 
\he Cr.ln-Mah.td larger sample. Vherea. «e find an avetaga desegregation 
effect in all 93 studies of .08 standard deviations .our estimate for the 
18 of our studies selected h, the panel is signif i'cantl, higher.' .16. This 
is likely the result of discarding hon-longitudinal studies. « desegregation 
tas a positive effect, then.it follows. asWortman notes, that accurately 
done desegregation studies will show a, positive effect and the panel's 
.exclusion of tactically inferior studies should produce a higher estate 
o£ the effect of desegregation than our strategy of including every study • 
regardless of -uality. «e arrive at this same conclusion .in a different way. 
' „ coding the different types of, research' design IsWriahTe for each 
.tudy we show th.t technically better research designs are correlated 
with -r. positive, effects of desegregation. As Tshle 3 indicates , studies 
in which the perforce of blac k . in desegregated-.chooWe cohered to 
..rf.™r.« of whites, or the performance of the testmaW. .arming sample, 
often conclude that desegregation has failed to improve blecU achievement. 



Table 3: Direction and Size of Treatment Effect, 
by Type of Control Group 



Design 

1. randomized 

2. longitudinal 

3. cross-sectional 

A. cohort J 

5. white controls 

6. norm controls 

total sample 



. direction 
of effect 



86 

55 

62 

53, 

33 

34 

54 



5 
20 
13 
16 

8 

11 
16 



10 
.25 
26 
31 
58 
54 

30 



(n) 



(21) 
(141) 
(39) 
(64) 
(12) 
(44) 

(321) 



effect 




size 
d 


(n) 


.235 


(15) 


.083 


(116) 


S130 1 


(34) 


.084 


(53) 


.058 


(12) 


-.030 


(39) 


.080 

♦ 


(269) 



1/ 



17 



0. the other « studies which cohere desegregated blacks to segregated 
hlacks-eithe'r in a "cohpxt" design (the segregated blacks are the students 
in that same grade in the years -before desegregation), a .•cross-sectional!- . 
|sign (with no pretest)^ a longitudinal design-are twice .as likely to 
L. Positive as negative results; and randomized experiments show positive 

results eight or nine times as often- as negative -results,. . 

The problem with the research panel's approach is that by excluding 

supposedly inferior studies by one. criterion, they have managed to exclude 
Ut of the experiment. a..d all of the stu^ (except for Carrigan) in 
1 vhich students were desegregated ^^kindergarten or first grade. Figure 1 
showsWot of the effect sires estimated by Hahard and Crain for 28 sables 
oE students-' in the eighteen evaluations selected by the panel.. This is. - 
shown as a heavy line, which changes to a dsshed line where it. Joins dot. - 

based only on one or two samples of students* 

F . f not: ftamoles in the 93 studies 

The effect sizes for the entire group of 295 samples an 

„e reviewed are show, as s light solid line. In grades 2 through 5 (where 

'the bulk of the samples studied-by the panel began desegregation) our 

estimates of effect size for the panel's studies is considerably higher 

than our estimate for.the larger set of studies. The graphs also shows. 

u6 i ng the letters A a. ^-^L^^^ 

k , Armor and Stephen. In the range from second grade through fifth, their- 
estimates are also generally higher the* our estimates for our lerger _ 
« sample. Thus; *e .gain th»t the more selective sample shows higher 
estates, presumably because it he. discarded the very weak designs which 
.re biased toward underestimating the effects of desegregation. .At the , . 
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aas* Ctae. the other point of this graph is that there ore no data 
points in the panel's nlf.eteen studies for kindergarten and only-1 data - 
joint for first grade. (The one first-grade datum is regrettably . 

rather untrustworthy estimate hy Cardan. which uses a 50% black school 
£ its control group.) ■ Also shown.on the gr|h is a eirele located above 
first grade, at approximately +.30 standard deviations, indicating the 
estimated effect sire .predicted by our regression equation for a typical 
•study of students desegregated at first grade using a randomized experimental 
design. If one were willing to assume that Armor's and Stephen's data 
supported the'early grade effect, an' extrapolation down to grade 'one from^ 
their date would seem consistent with this estimate. " Unfortunately, given- 
the relative .mail number of cases and the rather ragged pattern in the data, 
it is difficult to say whether either Stephen's or Armor's calculations S- 
support the hypothesis that there are stronger effects at lower grade levels. • 
The problem ii again made more difficult by the prior selection of . 
studies which hes reduced the n^er of cases sogreaUy that it is difficult 
' t „ compute reliable correlations within the data. The best data on the , 
,ueStion is the Crain and Hahard analysis.. Table « presents thet data, and 
■ .bows a suite strong pattern. Of 55 studies of students desegregated in 
kindergarten or first grade, AS (g«> she* a positive delegation effect. 

Another way to think of the difference between the smalls 'and large- 
. meta-analyses is to say that one does the selection at the beginning of the 
project to narrow the focus- upon the most interesting cases while the other 
L. that' selection at the end. !n the analysis which ,«ahard- and.1 did. 
1 identified 20 studies as being the best. Since this selection was based 
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Table 4: Direction and Size of Treatment Effect, 
By .Grade at Initial Desegregation 



grade at 
desegregation: 
KG' 
1 
- 2 
3 
4 
5 
6 

7-9 
10-12 

total sample 



Direction > 




Effect / 


of -Ef f ect 




Siz 
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"+ - 


0 




(*) 


d 


(n) 


100 
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i o 


- (ID 


— 439 


(10) 


77 
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16 


(44) 


• 203 


(40) 


56 


5 8 


36 


. (36) 
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(32) 


50 


26 


24 


(54) 
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(46) 


53 
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26_ 


(38) 


• 073 


(32^ 


44 
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49 
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• 016 


(33) 


52 
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40 


(25) 


• 090 


(21) 


56 


16 


28 


(25) 


".011 


(22) 


48 


22 


30 


(23) 


• 005 


(17) 


56 


14 


29 


(295) 


.079 


(253) 
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upon; the empirlcel findings of the analysiB; its" main considerakon vas 
that the students being studied in each case had to have been desegregated 
£t kindergarten or grade one. Beyond that, we required that there be a 
Control group of cagregated black students but our requirements fpr 
ithodology and the amount of material reported by the authors were more 
generous than the panel's. Whether our group of 20 is superior to the group 
of .19 selected by the panel is a matter fpr the, reader to decide, of course. 

The 20 "best" studies • « 

Five of the ,20 studies use a randomized experimental design: 
Stanley Zdep (1971) of ETS carried out an 'evaluation of a city-to- suburban 
Voluntary transfer plan from Newark, HJ to a suburb, Verona. Verona apparently 
agreed to accept 38 students, and the city held a lottery among all applicants. 
Zdep then used a random selection from the unchosen volunteers as his 
control group. He limited his analysis to students in first and second grade. 
The firstjyraA^^^ 

posttested with the Cooperative Primary Test: On the pretest, the control 
group tested about .1 standard deviations above the students being transported 
" to the suburbs; on the post test bussed >tudents^were 9.8 answers higher 
than the control group on a test on which the bussed students hid a standard 
deviation of 5.4 and the control group aV an dard deviation of 3.8. In math, 

the posttested scores favored the 

group standard deviation 6.3) and in a subtest called listening, favored 
the bussed students by 6.0 points (control group standard deviation 5.7). - ^ 
Averaging the three yields an erf ects size of 1.60. -This study was not 
included in the panel's 19 studies, although Zdep's analysis of second grade 
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«^»t« «« lauded. • P .««bX, the first grade data «. dropped be|ause 
d tff"ere«t teats were used for' the pretest and postteat. Given that the 
difference on the readiness test between the two groups was -U, favored 
& control group, and fcost importantly that\he. students vere selected by 
4dom assignment, the requirement that the tests be Identical seems overly 
strict. The main problem with the Zdep analyaia la that there are only 13 . 
transported students and a control group of U in the first grade: (Even 
with the small simple site there is no problem with significance. The 
rea : di,E test differences yield a t bf about 10. for example.)" 

Bruce Wool (1968) wrote his doctoral dissertation on the Project Concern „ 
voluntary city-to-suburb program in> Hartford, CT. Be analyzed changes In 
10. scores. "Two-hundred and sixty-si, students in grades kindergarten 
through five were- randomly selected and a control of 303 students was selected, 
elso randomly. At the pretest the control group scored .6 1Q points higher 
than the experiment group. In the analysis he divided the group by grade _ 
Wei, combining kindergarten and first grade students, and carried out an 
enalysis „f^a7lencd. He does not report factual raw means, but the 
obtained/of «.« suggests that there must have been a difference of 1/3 . 
standard/deviations favoring/the experimental, group. , 

is Hahan (1«» -as director of the Hartford Project Concern 
rr -j- . t ^t time , and conducted his ow n evaluation, -«e used -dat a during 
■ tA'cond ,«. of the project, so that presumably hie results are more 
■biased by attrition from the original random treatment and control group • 

tbsn sr. Wood's. Por the second year of the pro^t, 55-** -«^» 

9 point' incresse in IQ for the treatment groups wh^ entered the program In 



kindergarten and an average gain of 2.6 points for those entering the 
program in the first grade, compared to control group increases of 3 and 2 
points respectively. There are also large differences favoring the ^ 
treatment group for students «ho entered th- -ogram in grades 2 and 3 and 
ne ija ti -a treatment effects for stud — "he program in grades 

A* and 5. Kahan also reports the resulf.s of achievement testing using the 
Metropolitan Readiness Test which r' d some significant differences for 
the kinoergarten group fav^ag t. ,ed .student^ and also some results . . 

from the Primary Mental Abilities Test which showed results for both 
kindergarten and' first grade students favoring the experimental group. 

Project Concern operated in several cities in Connecticut and 
Joseph Samuels wrote a dissertation. (1971) evaluating the New Haven program. 
He compared 37 students ^transferred to the suburbs at kindergarten to 
a control group of 50 students. There are possible biases here, in that 
Samuel's transferred students were apparently screened after being randomly . 
selected' to' drop students who v "had medical or psychological reasons 
' precluding their involvement..." He does not say how many students were 
omitted in this way. In addition, the control group was limited to students 
vhd remained in the same school for- , years, which presumably, would bias 
the control group upward. II there ^differences between the two groups 
they do not appear on the Monroe fading Aptitude Test administered to the 
two groups while in kindergarten; the experimental group tested only .03 
.tndard deviations higher. Two years later, the treatment group, tested 
- 5.5 units higher-bn a reading test with a standard deviation of 12. They 
7 also tested 5.6 units abov4 a group of students in a compensatory education 
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program.!* the city, both differences being significant. The Project Concern 
students did not test higher than' the control group in either word analysis 
or mathematics-they were about .25 -standard deviations lower on both tests, 
f- • Mea nwhile s thJ Rochester city schools carried out a similar city-to- 
sLurb pp^S<(Rock, et al. , 1968). In each of three sars 25 experimental 
subjects were selected and allowed to transfer to the suburbs while 25 others 
were held as a control group in the central city. The f irst^experimental 
group scored below the control group on the pretest (the Metropolitan 
Readiness Test). At the end of tl*e first year, the treatment students did 
not score higher on the Metropolitan Achievement Test, but did score one-half 
year ahead of the control group on the SRA battery. The second experimental 
' group also 'scored below their control on the Readiness Test, but after one. 
year scored about. three months ahead of the control group. -At the end of 
one year the third experimental group did not score above the control group 
in reading but did score 6 months ahead of the control group in math. In 
that year, the treatment group was slightly superior to the control group 
on the pretest, which was the Hew York State Readiness Test, so this result 

is questionable. T 

' Hone of these five experimental studies were selected by the panel. 
Usually the reason is because the" pretest and posttest were not the same. 
' It is nearly impossible to design a study with identical tests covering 
■ the kindergarten-first grade range, since the students cannot read at the 
beginning of that period. Tests are notorious unreliable for students at 
this age. In addition,, all five of the' experimental designs used analysis 
of covariance models and relatively little information was provided with 
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rf^co coyote effect Wn.ll,. H five studies have problems 

ri th ettrition: 'it is doubtful «*«•.*. attrition problem ere «r. severe . 
ia these studies then they ere in the longitudinal studies used by the 
Mil, hut these studies ere ususlly «~ detailed in describing attrition, 
Ju»S it harder to overlooU a pr „hle^ieh is in fact present in the najoricy 
"of lon E itudlnal studies of education. Inderal wed^ot thinh CKa^he- 
studies should he considered' inferior to those chosen by the panel. 

^.ere are 8 other studies vhich use what^e call "cohort" comparisons 
(and which others often call "hiscor^control groups"). " These studies 
compared scores of desegre E ated st4nt S in .a particular sr.de to the scores 
that blacKs made in the same grade before desegregation occurred. Th.s 

+ n c t»idv desegregation in a community where 
kind of design is the only way to study des-greg 

all school, have been desegregated, sin.e », ****** * blacU students 

U« of years which would enable one to conduct an interrupted time-series 
m .l,sls. : tor example, the Kashville-Psvidson County public schools (1,79) 

year period fro* 1970 when the desertion plan was adopted « 1978. 

^ t .. t .cores show a considerable E sln over that period. ranging f rom .» 

t0 .« standard' deviations. Of course, the. problem is chat we cannot ac„ibute 

Hnn . ft be due to 'other changes in testing or 
this to desegregation; it may ds uu , ^ 

educational practice in the city. " 

district Would be anxious to. publish 
One wonders whether a school district wo ..^ 

t he result, ne E ative effect., .rerhape »7 other achool .ierricte 

_ t he same aort of data that Xashville has but have not released to 



interested researchers because it shows declines in: achievement. But 

one example which vorks the. opposite direction is from Pasadena, whose 

school board has been lament ly opposed to mandatory desegregation, and 

Pleased a lengthy report by Harold Kurtz (1975) showing the disastrous 

educational consequences of desegregation there. In 15 tests of students 

who were desegregated in grades 2 through 12, scores were lower after 

desegregation 14 times. But there were very large achievement increases 

for students who in kindergarten and first grade-averaging .36 standard 

deviations* Thus while test scores dropped for black students throughout 

the district during the period of time after desegregation, test scores 

of the very youngest student^ went up. This could be a peculiarity of the 

.' ■If. * $ " 

testing procedure used with the youngest students of course. . . 

Cohort analysis is necessary when a district is totally desegregated. . 

Total desegregation in the north came first to university communities, the 

largest of which was Berkeley, which desegregated in 1968. Test scores 

dropped that spring, about .04 standard deviations in reading for first 

graders. By 1970*, second graders were reading about .16 standard deviations 

aboVe the second graders of 1968. Thus one report (Dambacher, 1971) shows 

essentially no change in test scores usirig the first year^ of desegregation, 

.while a second paper (Lunemann, 1973) shows a positive desegregation effect. 

(In this analysis black and "other," presumably Hispanics who did not 

consider themselves whites, were combined in one year and separated In 

others. The percentage of "other" students in the district changed radically 

however, suggesting that these ethnic classifications were unstable. *e h^e 

combined "others" with Blacks for all years in order to avoid this problem.) 



Another university town which developed a desegregation plan was 
Evans ton. Jayjia Ksia-of ETS (1971) carried out a lengthy evaluation, and 
found that in the fall of the third grade, two years after desegregation, j 
Students were testing .01 standard deviations below students two years j 
earlier. She found gains in only 3 out of 9 tests in the upper grades over / 

■ .» ' ' / 

the first two years. 

Another school district which- reported achievement test scores for the 
year after desegregation in comparison to the year before was Clark county • 
(Las Vegas) Nevada. . Test scores for black students weir e up .1 years. 

In one southern district, George Chenault (1976) found that students 
who were desegregated in kindergarten scored .3 years higher., in the fourth 
grade compared to students five years earlier. \ 

Finally w* have constructed a cohort analysis from the data provided ...^ 
■by Patricia Carrigan (1969). The panel treated Carrigan as a longitudinal 
study, but the "segregated" control school is 50% black— desegregated by 
most people's criteria. We ignored the data for the control school and 
instead compared the performance of the desegregated black students to black { 
students at the sending school prior to desegregation./ We found the integrated 
students scoring .05 standard deviations higher. 

All the cohort studies are subjecfT^alternative interpretations- 
change in curricula, in type test, in test administration, could all 
affect test scores. On the other hand, cohort studies have the advantage 
of having relatively large sample sizes. They are also not likely to be 
affected by complicated statistical procedures which sometimes do more 
harm than good. Of eight studies of students desegregated at kindergarten 



or first grade, we found gains to-6, the exceptions being Hsia's Evanston 
study and Dambacher's Berkeley study, whose conclusions were reversed the 

following year by Lunemann.* ^ . 

The final group of studies if students desegregated at first grade or 
Xndergarten are longitudinal studies with non-random assignment,. These are 
generally the most difficult studies to draw conclusions from, because the 
inability to use accurate pretests with very young children makes statistical 
etching extremely difficult. In the two best studies, by Louis Anderson 
(1966) of Nashville's early f reedom-of-cholce plan, and Louise Moore (1971) 
of DeKalb county. GA, the full data was provided making it possible for 
Mahard and me to reanalyze the data. In both cases we 'examined student growth 
during the middle of elementary school, comparing growth rates for students 
who had experienced desegregation from kindergarten or first grade to other 
students in segregated schools in earlier years. One study showed a sizeable, 
increase in the rate of learning while the other study showed a loss after 
desegregation. We were reluctant to take either study seriously, since we 
,are not sure how to relate these two studies of growth rates several years 
«ft« desegregation to all the other studies, which measure growth immediately 
following desegregation.; Five other studies pretested students at kindergarten 
' or first grade and posttested them one or 'two years later. These are 

usually very' brief reports of studies wffch relatively small' sample sizes. ; 

Orrin Bowman's (1S73) dissertation evaluates a voluntary plan in 
Rochester. NY. Two experimental groups exceed the controls (both a regular 
'class and an "enriched" class) by .IS.and .32 standard deviations on a 

1980. We received it! too late to include in our review. 
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readiness test at grade 1; at grade 3 they exceed^he controls on an 
achievement battery by .90 and .88 standard devotions. Bowman's analysis 
of covariance shows net effects of .75 and .7<j; using' the panel's ' ^ 
pricedure. I get effects of .72 and .66. There are only 19 aad 17_treatment J 
Ejects. Ann Danahy (1971) compared 41 volunteers for desegregatibn to 
a control group random chosen from a segregated school. Little raw data 
is provided. The author usas- regression to control on the seemingly large 
pretest differences on the Metropolitan Readiness Test, and obtains non- 
significant positive treatment effect's. The technique used overestimates 
^treatment effects, however.. 

Robert Frafy and Th<-as Coolaby (1979) cohere 32 desegregated first 
graders to 77 in segregated schools, using the Metropolitan Readiness Test 
ES a pretest and Metropolitan Achleve^Test adlnistered at the end pf 
first .grade as a posttest. There were large' differences (on the order of 
' .7 years) favoring the desegregated students. The pretest data.was used 
t„ trichotoMte the sample before paring Posttest -esns within each group. 
El^er Le^e (1979) . studying Peoria. Illinois* studied ISO desegregated 
and 60 segregsted black students \ five yeers after desegregation began. 
He used 7he Metropolitan Readiness Test and the Iowa Test of Basic Skills, 
sud fo^d only one significant positive effect end no significant negative 
effects out of a possible ten differences f we Judged the overall effect 
■ as aero. T. G. «ol»an (1964) studied Me. Rochelle. using the MAT to pretest 
< and, posttest desegregsted snd' segregsted elementary school students and j 
the Metropolitan Readiness test to pretest, and posttest kindergarten j 
.tudents. He reports no signlflcsnt desegregation effects on the MAT. 
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but significant gains for kindergarten students. He reports none* of the, 
data, however. Of these five studies, only Bowman is included in the 
panel's group of 19. The other 4 studies were rejected either because they 
used different tests for pretest and posttest or because insufficient '• 
statistics were provided in the write-up to permit us to compute an tef feet 
sizes In my judgment none of these 5 studies should be considered of 
especially good quality. 

Conclusions 

It is stretching a point to. argue that the twenty kindergarten-first 
grade studies are the "best" studies, given their wide range^f quality.. 
" They were not selected as m odels of_resear ch^ butj>ecause they Save "hat: 



we thought were the least biased estimates of the effect of desegregation. 
We do believe that several of these studies are better than the average of 
' the panel's select tans* which were supposedly- intended to be the "best," but 
we are not conducting a prize competition for best dissertation*. of the 
last two decades. We are" trying to estimate the effects of desegregation. 

Our 20 "best" studies include 5 analyses of four different experimental 
designs, all showing relatively large positive treatment effects (the 
median treatment effect size of these "experiments is .34 standard deviations). 
We also found 8 "historical control groups" studies, %ik of. which showed a 
positive treatment effect and only 1 a negative effect; the median effect 
size was .12 standard deviations/ Finally, we found 7 longitudinal studies, 
five of which showed positive treatment effects and only one a negative 



One of the 93 studies, a dissertation by Ann Linney (1979) did win m 
'prize from the American Psychological Association; it was not included in 
either the panel's group of -19 or our list of 20. 



effect, with a.median effect eize of .24.' Consistent positive outcomes on 
5 analyses of randomized experiments is impressive. While the othjsr studies 
jre a good deal weaker methodolgically,' their results are also consistently y 
^Bitive—11 studies of*15 are positive and only 2 are negative. If the 



principle function of selecting a superior subgroup of studies is to find 
the consistent of results which is masked by error in an unselected sample 
^ of studies, we believe we did that, and that the panel did not. 
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